154 research outputs found
Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration
State-of-the-art convolutional neural networks are enormously costly in both
compute and memory, demanding massively parallel GPUs for execution. Such
networks strain the computational capabilities and energy available to embedded
and mobile processing platforms, restricting their use in many important
applications. In this paper, we push the boundaries of hardware-effective CNN
design by proposing BCNN with Separable Filters (BCNNw/SF), which applies
Singular Value Decomposition (SVD) on BCNN kernels to further reduce
computational and storage complexity. To enable its implementation, we provide
a closed form of the gradient over SVD to calculate the exact gradient with
respect to every binarized weight in backward propagation. We verify BCNNw/SF
on the MNIST, CIFAR-10, and SVHN datasets, and implement an accelerator for
CIFAR-10 on FPGA hardware. Our BCNNw/SF accelerator realizes memory savings of
17% and execution time reduction of 31.3% compared to BCNN with only minor
accuracy sacrifices.Comment: 9 pages, 6 figures, accepted for Embedded Vision Workshop (CVPRW
Analysis and Optimization of GNN-Based Recommender Systems on Persistent Memory
Graph neural networks (GNNs), which have emerged as an effective method for
handling machine learning tasks on graphs, bring a new approach to building
recommender systems, where the task of recommendation can be formulated as the
link prediction problem on user-item bipartite graphs. Training GNN-based
recommender systems (GNNRecSys) on large graphs incurs a large memory
footprint, easily exceeding the DRAM capacity on a typical server. Existing
solutions resort to distributed subgraph training, which is inefficient due to
the high cost of dynamically constructing subgraphs and significant redundancy
across subgraphs.
The emerging persistent memory technologies provide a significantly larger
memory capacity than DRAMs at an affordable cost, making single-machine
GNNRecSys training feasible, which eliminates the inefficiencies in distributed
training. One major concern of using persistent memory devices for GNNRecSys is
their relatively low bandwidth compared with DRAMs. This limitation can be
particularly detrimental to achieving high performance for GNNRecSys workloads
since their dominant compute kernels are sparse and memory access intensive. To
understand whether persistent memory is a good fit for GNNRecSys training, we
perform an in-depth characterization of GNNRecSys workloads and a comprehensive
analysis of their performance on a persistent memory device, namely, Intel
Optane. Based on the analysis, we provide guidance on how to configure Optane
for GNNRecSys workloads. Furthermore, we present techniques for large-batch
training to fully realize the advantages of single-machine GNNRecSys training.
Our experiment results show that with the tuned batch size and optimal system
configuration, Optane-based single-machine GNNRecSys training outperforms
distributed training by a large margin, especially when handling deep GNN
models
MgX: Near-Zero Overhead Memory Protection with an Application to Secure DNN Acceleration
In this paper, we propose MgX, a near-zero overhead memory protection scheme
for hardware accelerators. MgX minimizes the performance overhead of off-chip
memory encryption and integrity verification by exploiting the
application-specific aspect of accelerators. Accelerators tend to explicitly
manage data movement between on-chip and off-chip memory, typically at an
object granularity that is much larger than cache lines. Exploiting these
accelerator-specific characteristics, MgX generates version numbers used in
memory encryption and integrity verification only using on-chip state without
storing them in memory, and also customizes the granularity of the memory
protection to match the granularity used by the accelerator. To demonstrate the
applicability of MgX, we present an in-depth study of MgX for deep neural
network (DNN) and also describe implementations for H.264 video decoding and
genome alignment. Experimental results show that applying MgX has less than 1%
performance overhead for both DNN inference and training on state-of-the-art
DNN architectures
GuardNN: Secure DNN Accelerator for Privacy-Preserving Deep Learning
This paper proposes GuardNN, a secure deep neural network (DNN) accelerator,
which provides strong hardware-based protection for user data and model
parameters even in an untrusted environment. GuardNN shows that the
architecture and protection can be customized for a specific application to
provide strong confidentiality and integrity protection with negligible
overhead. The design of the GuardNN instruction set reduces the TCB to just the
accelerator and enables confidentiality protection without the overhead of
integrity protection. GuardNN also introduces a new application-specific memory
protection scheme to minimize the overhead of memory encryption and integrity
verification. The scheme shows that most of the off-chip meta-data in today's
state-of-the-art memory protection can be removed by exploiting the known
memory access patterns of a DNN accelerator. GuardNN is implemented as an FPGA
prototype, which demonstrates effective protection with less than 2%
performance overhead for inference over a variety of modern DNN models
Decoupled Model Schedule for Deep Learning Training
Recent years have seen an increase in the development of large deep learning
(DL) models, which makes training efficiency crucial. Common practice is
struggling with the trade-off between usability and performance. On one hand,
DL frameworks such as PyTorch use dynamic graphs to facilitate model developers
at a price of sub-optimal model training performance. On the other hand,
practitioners propose various approaches to improving the training efficiency
by sacrificing some of the flexibility, ranging from making the graph static
for more thorough optimization (e.g., XLA) to customizing optimization towards
large-scale distributed training (e.g., DeepSpeed and Megatron-LM).
In this paper, we aim to address the tension between usability and training
efficiency through separation of concerns. Inspired by DL compilers that
decouple the platform-specific optimizations of a tensor-level operator from
its arithmetic definition, this paper proposes a schedule language to decouple
model execution from definition. Specifically, the schedule works on a PyTorch
model and uses a set of schedule primitives to convert the model for common
model training optimizations such as high-performance kernels, effective 3D
parallelism, and efficient activation checkpointing. Compared to existing
optimization solutions, we optimize the model as-needed through high-level
primitives, and thus preserving programmability and debuggability for users to
a large extent. Our evaluation results show that by scheduling the existing
hand-crafted optimizations in a systematic way, we are able to improve training
throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by
up to 1.32x on multiple machines with up to 64 GPUs, when compared to the
out-of-the-box performance of DeepSpeed and Megatron-LM
- …